AITopics | vision-language task

Collaborating Authors

vision-language task

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Example Pair: Depth-> Image Output Example Pair: Hed-> Image Output In-Context Learning Unlocked for Diffusion Models

Neural Information Processing SystemsApr-25-2026, 12:25:34 GMT

Given a pair of task-specific example images, such as depth from/to image and scribble from/to image, and a text guidance, our model automatically understands the underlying task and performs the same task on a new query image following the text guidance.

large language model, machine learning, natural language, (18 more...)

Neural Information Processing Systems

Country: North America > United States (0.28)

Genre: Research Report > New Finding (0.93)

Industry: Media (0.68)

Technology:

Information Technology > Sensing and Signal Processing > Image Processing (1.00)
Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.69)

Add feedback

23fa71cc32babb7b91130824466d25a5-Supplemental.pdf

Neural Information Processing SystemsApr-25-2026, 03:28:13 GMT

artificial intelligence, machine learning, natural language, (16 more...)

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language (0.75)

Add feedback

Prism: A Framework for Decoupling and Assessing the Capabilities of VLMs

Neural Information Processing SystemsMar-22-2026, 12:10:19 GMT

Vision Language Models (VLMs) demonstrate remarkable proficiency in addressing a wide array of visual questions, which requires strong perception and reasoning faculties. Assessing these two competencies independently is crucial for model refinement, despite the inherent difficulty due to the intertwined nature of seeing and reasoning in existing VLMs. To tackle this issue, we present Prism, an innovative framework designed to disentangle the perception and reasoning processes involved in visual question solving. Prism comprises two distinct stages: a perception stage that utilizes a VLM to extract and articulate visual information in textual form, and a reasoning stage that formulates responses based on the extracted visual information using a Large Language Model (LLM). This modular design enables the systematic comparison and assessment of both proprietary and open-source VLM for their perception and reasoning strengths. Our analytical framework provides several valuable insights, underscoring Prism's potential as a cost-effective solution for vision-language tasks.By combining a streamlined VLM focused on perception with a powerful LLM tailored for reasoning, Prism achieves superior results in general vision-language tasks while substantially cutting down on training and operational expenses. Quantitative evaluations show that Prism, when configured with a vanilla 2B LLaVA and freely accessible GPT-3.5, delivers performance on par with VLMs $10 \times$ larger on the rigorous multimodal benchmark MMStar.

artificial intelligence, large language model, natural language, (9 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)

Add feedback

VisionLLM: Large Language Model is also an Open-Ended Decoder for Vision-Centric Tasks Wenhai Wang 2 Zhe Chen 1,3 Xiaokang Chen 1,4 Jiannan Wu

Neural Information Processing SystemsFeb-16-2026, 22:29:37 GMT

It's noteworthy that, with a generalist LLMbased framework, our model can achieve over 60% mAP on COCO, on par with

large language model, machine learning, natural language, (16 more...)

Neural Information Processing Systems

Country:

Asia > China > Shanghai > Shanghai (0.04)
North America > United States > California > Santa Clara County > Palo Alto (0.04)
Asia > China > Jiangsu Province > Nanjing (0.04)
Asia > China > Hong Kong (0.04)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.47)

Add feedback

Aggregate-and-Adapt Natural Language Prompts for Downstream Generalization of CLIP

Neural Information Processing SystemsFeb-16-2026, 17:35:29 GMT

Large pretrained vision-language models like CLIP have shown promising generalization capability, but may struggle in specialized domains ( e.g., satellite imagery)

classification, large language model, machine learning, (21 more...)

Neural Information Processing Systems

Genre: Research Report > Experimental Study (0.93)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.73)
(2 more...)

Add feedback

9a6a435e75419a836fe47ab6793623e6-Paper-Conference.pdf

Neural Information Processing SystemsFeb-16-2026, 02:18:59 GMT

large language model, machine learning, natural language, (21 more...)

Neural Information Processing Systems

Country:

Asia > Singapore (0.04)
Europe > France > Île-de-France > Paris > Paris (0.04)
Asia > China > Hong Kong (0.04)

Genre: Research Report (0.46)

Industry: Information Technology (0.46)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (0.93)
(2 more...)

Add feedback

1b3750390ca8b931fb9ca988647940cb-Paper-Conference.pdf

Neural Information Processing SystemsFeb-8-2026, 12:37:32 GMT

large language model, machine learning, natural language, (18 more...)

Neural Information Processing Systems

Country: North America > United States > Texas > Travis County > Austin (0.04)

Genre: Research Report > New Finding (0.93)

Industry: Information Technology (0.46)

Technology:

Information Technology > Sensing and Signal Processing > Image Processing (1.00)
Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.70)

Add feedback

Align before Fuse: Vision and Language Representation Learning with Momentum Distillation

Neural Information Processing SystemsDec-24-2025, 02:49:16 GMT

Large-scale vision and language representation learning has shown promising improvements on various vision-language tasks. Most existing methods employ a transformer-based multimodal encoder to jointly model visual tokens (region-based image features) and word tokens. Because the visual tokens and word tokens are unaligned, it is challenging for the multimodal encoder to learn image-text interactions. In this paper, we introduce a contrastive loss to ALign the image and text representations BEfore Fusing (ALBEF) them through cross-modal attention, which enables more grounded vision and language representation learning. Unlike most existing methods, our method does not require bounding box annotations nor high-resolution images.

momentum distillation, name change, vision and language representation learning, (8 more...)

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Natural Language (0.82)
Information Technology > Sensing and Signal Processing > Image Processing (0.59)
Information Technology > Artificial Intelligence > Machine Learning (0.56)

Add feedback

Probing Inter-modality: Visual Parsing with Self-Attention for Vision-and-Language Pre-training

Neural Information Processing SystemsDec-23-2025, 21:33:12 GMT

Vision-Language Pre-training (VLP) aims to learn multi-modal representations from image-text pairs and serves for downstream vision-language tasks in a fine-tuning fashion. The dominant VLP models adopt a CNN-Transformer architecture, which embeds images with a CNN, and then aligns images and text with a Transformer. Visual relationship between visual contents plays an important role in image understanding and is the basic for inter-modal alignment learning. However, CNNs have limitations in visual relation learning due to local receptive field's weakness in modeling long-range dependencies. Thus the two objectives of learning visual relation and inter-modal alignment are encapsulated in the same Transformer network. Such design might restrict the inter-modal alignment learning in the Transformer by ignoring the specialized characteristic of each objective.

inter-modality, self-attention, visual parsing, (11 more...)

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Machine Learning (0.98)
Information Technology > Artificial Intelligence > Vision > Image Understanding (0.39)

Add feedback

Task-Aware Resolution Optimization for Visual Large Language Models

Luo, Weiqing, Tan, Zhen, Li, Yifan, Zhao, Xinyu, Lee, Kwonjoon, Dariush, Behzad, Chen, Tianlong

arXiv.org Artificial IntelligenceOct-14-2025

Real-world vision-language applications demand varying levels of perceptual granularity. However, most existing visual large language models (VLLMs), such as LLaVA, pre-assume a fixed resolution for downstream tasks, which leads to subpar performance. To address this problem, we first conduct a comprehensive and pioneering investigation into the resolution preferences of different vision-language tasks, revealing a correlation between resolution preferences with image complexity, and uncertainty variance of the VLLM at different image input resolutions. Building on this insight, we propose an empirical formula to determine the optimal resolution for a given vision-language task, combining these two factors. Second, based on rigorous experiments, we propose a novel parameter-efficient fine-tuning technique to extend the visual input resolution of pre-trained VLLMs to the identified optimal resolution. Extensive experiments on various vision-language tasks validate the effectiveness of our method.

large language model, natural language, resolution, (15 more...)

arXiv.org Artificial Intelligence

2510.09822

Country:

North America > United States (0.46)
Europe > Switzerland (0.28)

Genre: Research Report > New Finding (0.46)

Industry: Health & Medicine (0.46)

Technology: Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.86)

Add feedback